Deep features optimization based on a transfer learning, genetic algorithm, and extreme learning machine for robust content-based image retrieval

The recent era has witnessed exponential growth in the production of multimedia data which initiates exploration and expansion of certain domains that will have an overwhelming impact on human society in near future. One of the domains explored in this article is content-based image retrieval (CBIR), in which images are mostly encoded using hand-crafted approaches that employ different descriptors and their fusions. Although utilization of these approaches has yielded outstanding results, their performance in terms of a semantic gap, computational cost, and appropriate fusion based on problem domain is still debatable. In this article, a novel CBIR method is proposed which is based on the transfer learning-based visual geometry group (VGG-19) method, genetic algorithm (GA), and extreme learning machine (ELM) classifier. In the proposed method, instead of using hand-crafted features extraction approaches, features are extracted automatically using a transfer learning-based VGG-19 model to consider both local and global information of an image for robust image retrieval. As deep features are of high dimension, the proposed method reduces the computational expense by passing the extracted features through GA which returns a reduced set of optimal features. For image classification, an extreme learning machine classifier is incorporated which is much simpler in terms of parameter tuning and learning time as compared to other traditional classifiers. The performance of the proposed method is evaluated on five datasets which highlight the better performance in terms of evaluation metrics as compared with the state-of-the-art image retrieval methods. Its statistical analysis through a nonparametric Wilcoxon matched-pairs signed-rank test also exhibits significant performance.


Introduction
People nowadays love to capture and share their life happenings e.g. via social media platforms which leads to the extensive growth of multimedia data, it triggers the need for certain a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 convergence and generalization as compared to alternate classification methods like support vector machine (SVM), Boltzmann machine (BM), restricted Boltzmann machine (RBM), deep belief network (DBN), Hopfield neural network (HNN), etc. [6,7].
The main contributions of the proposed method are as follows: i. An optimized feature set is constructed by applying a genetic algorithm over-extracted deep features from VGG-19 architecture for robust CBIR and to reduce the computational expense of the proposed method.
ii. The semantic gap issue of CBIR is reduced between the extracted features and high-level semantic concepts of the images.
iii. For efficient, effective learning and convergence, an ELM classifier is utilized for the proposed method.
iv. An extensive experimental analysis over five datasets (namely Wang-A, Wang-B, OT Scene, Wang 10k, and Caltech 256) is conducted to examine the scalability of the proposed method as compared with state-of-the-art CBIR methods.
The rest of the paper is organized as follows: Section 2 highlights some of the existing related work. Section 3 presents the proposed methodology in detail. Experimental discussion and achieved results are provided in section 4. Section 5 presents the conclusion and future direction of the research.

Related works
Fadaei et al. [8] address issues such as noise and image translation by integrating several wavelets and curvelet features along with the dominant color descriptor (DCD). Firstly, HSV color space is considered because of its ability to differentiate chromatic and achromatic components precisely. The extraction of DCD features from HSV color space resulted in coarse partitions. So, to get even partitions, pixels are classified based on similar probability. After that, corresponding centers are defined based on their distances, not their partitions, to yield better accuracy. Meanwhile, the combination of the Frobenius norm with wavelet and curvelet transform is proposed for texture representation. Grouping of three feature sets using a particle swarm optimization algorithm showed better performance than competitor methods even with more running time. Images should not only be compared depending on their regions but also on their nature because if the regions are considered only, the accuracy of the system would be inefficient. Considering this, image retrieval based on location-independent ROI is presented by Raghuwanshi et al. [9]. This novel approach segments an image into the texture and non-textured region. Tetrolet transform is used to highlight texture regions and for nontextured regions moment invariants in combination with the edge, features are used. Varying block sizes are used for finding optimal blocks for segmentation. A larger block size resulted in overlapping regions and increased segmentation time, therefore an 8×8 block is suggested. A similarity count is added to give a higher rank to images having more similar regions hence reducing no. of comparisons and better precision and recall. Another image representation method based on iterative DCT and sparse representation is presented in [10]. HSI and CIE-LAB color spaces are analyzed because of their uniform color perception which is considered to be related to human perception. Sparse representation is combined with several available acceleration techniques like DALM, PALM, etc. to investigate retrieval results. The proposed method's performance is evaluated by varying recall probability and averaged modified normalized retrieval rank. Experimental analysis shows a remarkable reduction in storage requirements and vector size.
To identify prominent objects with high precision in an image Rehan et al. [11] proposed a novel image representation method based on color histogram and bandelets transform. The proposed method highlights the most edifying texture regions and uses artificial neural networks to overcome incorrect geometric classification. After determining the semantic class of an image by SVM, a reverted index mechanism used by google for text-based search is also incorporated for fast image retrieval. Experimental evaluation shows promising results without external management from a user as with many relevant feedbacks based CBIR systems. If a machine vision system can identify salient objects in an image in the early stages of recognition, it will be possible to not only generate proper object detection windows for further processing but will also reduce computational costs to much extent. Considering this a method for salient object subitizing (SOS) combined with CNN is presented by Zhang et al. [12]. For the training of CNN based SOS model, 20k synthetic images are generated by varying no. of salient objects and background images. This method successfully suppresses false object detection and results in better average precision for images having 3 dominant objects. Hussain et al. [13] present an improved pre-processing technique using Quaternion transform to highlight salient regions of the image thereby improving the retrieval rate.
Anandh et al. [36] presented a hybrid framework comprising local features for CBIR. The framework uses color auto-correlogram, Gabor wavelet, and wavelet transform for extracting color, texture, and shape simultaneously. For deriving texture, six orientations and four scales are used. This method used SVM as a classifier and Manhattan distance as a similarity measure for image retrieval. In terms of performance, the combination of hybrid features resulted in improved performance as compared to the individual feature representation methods for image retrieval. Dubey et al. [37] have come up with a novel descriptor based on adder and decoder concepts. Local binary patterns (LBP) of three channels are combined with adder and decoder to yield outputs of 3 input channels, 4 adder channels, and 8 decoder channels. In terms of performance, decoder channels highlight color texture information better as compared to adders and input channels. A higher dimension of the feature vector is one of the shortcomings of decoder-based LBP. A robust feature representation model based on local texton XOR patterns (LTxXORP) is presented by Bala et al [38]. The proposed model divides the V space of HSV color space into sub-blocks of 2×2. Texton images are generated by applying 7 texton shapes on each sub-block. Afterward, an XOR operation is performed between the center pixels and neighboring pixels of the resultant image. Histograms of HSV color spaces and the texton XOR image are concatenated to get the final feature vector. The experimental analysis highlights the robustness of the proposed model as compared to other LBP-based methods. Another novel method based on bag-of-words (BoW) is presented by Sarwar et al. [39] to address the semantic gap issue that occurred in a CBIR system. The proposed method builds a dictionary that incorporates complementary features from both LBPV and LIOP descriptors by applying density-based spatial clustering of applications with the noise (DBSCAN) method. LBPV features overcame the loss of global texture information faced by LBP by adding variance as a weight to get the feature vector. On the contrary, to preserve both local and global order of pixel intensities, LIOP is used. The dimensions of a resulted feature vector are reduced using PCA and classification is performed using SVM. The experimental analysis highlights better recall by forming a small-size visual dictionary and better precision by forming a largesize dictionary. 'In most of the studies to represent features, the output of the last layers of a single CNN without quantization is used. Hence, the intermediate convolution layers remain neglected. To address this Alzu'bi et al. [40] have proposed a bilinear approach named CRB-CNN by modeling two CNNs in parallel i.e. VGG-16 and VGG-m for extracting features from intermediate convolution layers in an unsupervised way, which resulted in low dimension but compact and highly discriminative features of vector length 16. This method uses the first 15 and 30 layers of VGG-m and VGG-16 respectively and replaced fully connected layers with three new layers i.e. root pooling, sqrt, and L2 normalization. The model reduces the dimensions of image features into several compact dimensions i.e. 512,128, 64,32,16. The experimental analysis highlights the best retrieval accuracy overall dimensions when Euclidean distance is used as a similarity measure. In the case of Manhattan distance, accuracy tends to improve when vector size is set to 64 and starts to degrade when size is reduced to 32 or 16. In [41] Quantization as the pre-processing step is also suggested to reduce dimensions.
Mary et al. [42] presented a hybrid feature selection method based on a genetic algorithm. The feature set is a merger of color moments, entropy, energy, homogeneity, contrast, and feature descriptor. A backpropagation neural network is used as a feature selection algorithm as well as for classification. 10 best features are selected from a set of 26 features. The approach's performance is judged considering many similarity measures but the modified normalized retrieval rank evaluated the system accurately. Using CNN for feature extraction is also suggested by Shah et al. [43] based on the precision achieved against competitive methods. Bai et al. [44] have come up with an optimized version of AlexNet named (OANIR). The proposed improvements of this method are a combination of max and average pooling, the use of maxout function as the activation function for fully connected layers, and the addition of a hidden layer for binary code representation. At a hidden layer, a binary code function is used to limit output between 0 and 1. Extracted and queried binary codes are then judged based on hamming distance. OANIR has outperformed the original AlexNet in terms of precision and mAP even for large-scale image datasets. Can the same binary code be used for retrieval and compression to efficiently utilize storage? To answer this Zhang et al. [45] have studied deep networks for image compression. Two deep networks are trained. First, for representation of the image in compressed bitstream form, and second for extracting features. Both the trained networks are then combined using triplets of images. The proposed method outperformed in terms of JPEG compression and achieved a compression ratio of 5.3 for 32×32 thumbnails. The performance of several classifiers i.e. SVM, LSSVM, NN, ELM, and kernel ELM for the object recognition domain is evaluated by Zhang et al. [46]. The deep features are extracted using CNN having 5 convolutions and 3 fully connected layers pre-trained on the ImageNet dataset. Layers 6 and 7 are used as inputs for classification. The recognition accuracies are tested under three setups i.e. single domain, cross-domain using source, and cross-domain using source and target. In all three setups, kernelized ELM shows a state-of-the-art performance among all. Recent advances in CBIR are comprehended in [47]. The study highlights key challenges in generic modules of the CBIR framework and suggested a variety of representative strategies and methods to overcome recognized challenges. Guo et al. [48] describe various deep learning approaches comprehensively and summarizes the significant issues related to the design and training of deep networks. The study provides insight into the scope and compares the performance of deep networks on commonly used datasets. The details of competitive CBIR methods are presented in Table 1.

The methodology of the proposed model
This section discusses in detail the methodology of the proposed method as presented in Fig 2. The three primary steps of the proposed methodology are a) deep feature learning through transfer learning and VGG-19, b) selection of optimal features and c) image classification using ELM. A similarity between a query image and training images are judged based on Canberra distance. A detailed description of each of these steps is presented in subsequent sections.

Features learning
Machine learning algorithms have always worked by mapping the relationship between input and output data based on the learned knowledge. In case the input/training data shares the same feature space or distribution as output/testing data, the predictions would be accurate. On the contrary, if they belong to different feature spaces then the predicted outcomes would be inaccurate hence, degrades the overall performance of the system. As mentioned earlier, with the exponential growth of image repositories, utilizing an existing dataset not entirely similar but close to the target domain seems an efficient approach. Hence, in the proposed method transfer learning (TL) is being employed for optimizing time and resources by finetuning or utilizing a pre-trained network. TL is defined in [49] as given a source domain S and learning Task L S , a target domain T and learning task L T , transfer learning improves the learning of the target predictive function f T (�) in T by utilizing the knowledge in S and L S , where S6 ¼T and L s 6 ¼L T . When utilizing a pre-trained network, parameters of the initial layers are used as it is rather than initializing the parameters randomly which enhances the generalization ability of the model and accelerates the learning process.
In the proposed method, VGG-19 architecture (discussed in subsequent sections) is retrained on our selected datasets. As this network intake images of a specific size so, preprocessing of the images is being done to make them compatible with the network's initial layers. In VGG-19, after training the network, the fully connected layer 7 is considered as a feature map having a 4096×1 dimensional vector.
3.1.1 CNN architecture for the proposed method. A convolution neural network (CNN) starts with N number of training images, which are passed through several convolution layers followed by some pooling layers to the final fully connected layers. In convolution layers, features are extracted by convolving filters f of size m×m with image I at all spatial locations. A linear convolution operation outputs a feature map having distinct details and is smaller in size than the original image. Mathematically, a feature value in f th feature map of layer l at location (x, y) is expressed as: where x, b, and w represent input patch, bias, and weight vector, respectively. To detect nonlinear features, activation functions like ReLu [50], sigmoid, and tanh are mostly used to add non-linearity to CNN. ReLU activation function is expressed as: Early layers of CNN capture local details i.e. edges, curves, textures, etc. while as the layer gets deeper and deeper these networks can have a semantic level understanding of images like we humans do. Upper layers of CNN are also referred to by [4] as good descriptors. The number of kernels, stride factor and size are some of the parameters of convolution layers.
Stacked feature maps are then passed to pooling layers that reside between succeeding convolution layers to reduce the overall computation burden through a reduction in no. of trainable parameters. In other words, pooling layers reduce the dimension of feature maps by applying a downsample operation hence, achieving translation invariance. Max, min, and average pooling are some variant operations of this layer. For a pooling region of size n×n, max-pooling can be mathematically expressed as: where MX l j represents the output of max-pooling operation at layer l using n×n pooling region.
After passing through a series of multiple convolutions and pooling layers, the resultant output is then flattened into a single-dimensional vector which determines the probability of possible class labels. All the neurons of the previous layer are connected to each neuron of a fully connected layer to predict semantic association to a class. A loss function is then used to measure the prediction error. Once an error is calculated, results are backpropagated to update weights and biases to reduce misclassification.

VGG -19 for the proposed method.
VGG architecture is presented by the visual geometry group [51] in 2014. Its two variants are introduced i.e. 16 and 19 based on the depth of layers. Because of fewer parameters, deeper layers, uniform architecture, and small size convolution filter as compared to AlexNet, the proposed method uses the VGG-19 architecture. It comprises 16 convolution layers and 3 fully connected layers (Fig 3) and utilizes a 3×3 filter at all the convolution layers to learn as many complex features as possible and doubles its number after pooling layers to retain spatial dimensions while increasing depth. Color images of size 224×224 are first pre-processed by subtracting mean RGB values and then forwarded to convolution layers having a stride and padding of 1 pixel. The dimensions are reduced to half through 5 max-pool layers having filter of size 2 and stride 2 with no padding, occasionally between convolution layers. After convolution and pooling block, 3 fully connected layers along with dropout with a 50% probability to discard activations are utilized where the first two layers contain 4096 features and the last layer contains 1000 features. VGG uses the ReLU activation function for non-linearity and is trained by a mini-batch stochastic gradient descent algorithm.

Selection of optimal features through genetic algorithm
The purpose of this step is to refine the resultant feature vector by discarding irrelevant and redundant information that may affect the performance of the proposed model and end up being costly in terms of computation. Hence, a genetic algorithm (GA) [52] which is a stochastic search and optimization technique based on Darwin's theory of natural evolution is employed in the proposed method which articulates survival of the fittest. The reason for opting for GA lies in its parallelism as it can explore an entire feature space for potential solutions/features rather than exploiting a single candidate solution and avoid being stuck in finding a locally optimal solution. The main segments of GA are i) selection (probabilistic) ii) crossover, and iii) mutation. Initially, an entire feature set extracted in the previous step is considered as a population that is encoded as real numbers represented as chromosomes. Individual components of chromosomes are called "genes". Afterward, a probability score is calculated for each chromosome based on the fitness value calculated through the k-nearest neighbors (kNN) classifier [53]. In the selection process, a pair of fittest chromosomes among the entire population are selected through the roulette wheel selection method [54], where a slice of the wheel is assigned to each chromosome based on its probability value. A random pointer is attached to the wheel, which points to the chromosomes once the wheel is rotated. As the fittest chromosomes occupy a larger slice of a wheel, their chances of getting selected are higher than the ones having a minor share of a wheel. The selected pairs of chromosomes are then passed on to the crossover stage to generate a new child population. Single point crossover is applied in which genes of the parent chromosomes are swapped before and after the point which is selected randomly to get a mixture of parent's characteristics in child chromosomes. Moreover, to get child chromosomes with distinct characteristics along with inherited ones, a mutation operation is performed. The mutation operator maintains the diversity by altering randomly selected one or more genes within child chromosomes i.e. 0 to 1 and vice versa. Fig 4 shows the overall workflow of the genetic algorithm. The above operations are repeated until the population is converged, and no distinct features are being produced further. The final feature vector after this step can be expressed as:

Image classification using ELM
In this step, for learning a model, the reduced feature set along with labels are passed to the extreme learning machine (ELM) classifier. The ELM is first proposed by Huang et al. [55] for single hidden layer feedforward neural networks (SLFN). Instead of fine-tuning the weights of a hidden layer using traditional gradient-based methods, the parameters of hidden nodes can be initialized randomly and need not be tuned. Hence, this makes it a linear problem whose output weights can be determined easily by applying any generalized inverse operation on the hidden layer's output matrices. The schematic diagram of ELM classifier is shown in Fig 5. For M distinct samples (x i , y i )2R d ×R n , ELM classifier (one output node) havingM hidden nodes and activation function α(x) can be modeled as follows: where β i = {β 1 ,. . .,β n } is the weight vector having output weights between nodes of the output layer and i th hidden node and w i = {w 1 ,. . .,w M } is the weight vector connecting input nodes with i th hidden nodes. b i is the thresholding value for i th hidden node. The above equation can be represented in matrix form as where H represents the hidden layer output matrix, which is expressed as ¼ . . . For better generalization, ELM aims to minimize training error kHβ−Yk 2 and norms of output weights kβk. The value of β can be evaluated as where H † represents the Moore-Penrose generalized inverse of output matrix H which can be calculated through iterative methods, singular value decomposition, or orthogonal projection methods. For a multi-class classification problem to find an optimal solution, the objective function of ELM can be formulated as.
where ξ represents the training error and C is a tunable parameter that manages the distance between the margin line and ξ. While training an ELM, the following dual optimization problem needs to be solved which is based on the Karush-Kuhn-Tucker (KKT) theorem.
where L is the Lagrange multiplier. Corresponding optimality conditions based on KKT are as follows.
where L ¼ ½L 1 ; L 2 ; . . . ; L M � T and L e ¼ ½L e;1 ; L e;2 ; . . . ; L e;n � T . By solving these equations, the The final output of the ELM classifier can be expressed mathematically as follows: The class label to which the pattern x belongs is determined by the index of the output node with the largest output value.

Retrieval of the images
In this step, images are retrieved from the image database by measuring the similarity between query image q and dataset images d using Canberra distance which is mathematically defined as follows:

Evaluation parameters, results, and discussion
This section discusses in detail the chosen image datasets along with evaluation parameters that are used to assess the performance of the proposed method. A thorough discussion regarding attained results is also presented in subsequent sections.

Precision and recall.
Precision and recall are among the frequently used performance evaluators in the CBIR framework. Precision depicts the accuracy of a system by measuring the relevancy of images against retrieved images for a certain query q whereas, recall depicts the robustness of the system by identifying all relevant images within a dataset.
where X r represents no. of images retrieved as relevant, X t represents total retrieved images and X dt represents no. of relevant images in a dataset.

Average precision and mAP.
The average of precision values against a set of queries Q is known as average precision which is calculated as: Whereas the mean of average precision is referred to as mAP which is calculated as follows:

F-measure.
Another statistical measure that highlights the accuracy of a system and captures the properties of both precision and recall is called F-measure. Mathematically expressed as:

Experimental results and discussions
The proposed image retrieval framework is evaluated on 5 image datasets which are Wang-A, Wang-B, Wang 10k, OT Scene, and Caltech-256. 70% of the images are used for training and the remaining 30% are used for testing purposes from each dataset. The subsequent sections present details of each dataset along with retrieval results.

Performance assessment on the Wang-A dataset.
In the CBIR domain, Wang-A [57] is one of the widely used image collections which comprises a variety of images categorized into 10 semantic classes about 100 images for each semantic class. The resolution sizes of this image collection are 256×384 or 384×256. Fig 6 shows sample images from each class.
The experimental analysis based on precision, recall, and f-measure of the proposed method along with other competitive CBIR methods is presented in Table 2. As observed in Table 2, the proposed method exhibits the best performance among all because of a) optimal deep features obtained through genetic algorithm instead of handcrafted features which require considerable human effort in the feature selection process, b) an extreme learning machine classifier which is computationally fast as compared to traditional classifiers in CBIR because of its feedforward pass approach. In many groups of Wang-A dataset, the proposed method shows promising results and has also achieved the highest precision among all competitive methods. Fig 7 shows a query image of class "African tribe" which has distinct features and the top-20 images which are retrieved as relevant against the query image by the proposed method. Fig 8 shows the top-20 retrieved images against a query image taken from class "Elephants". The label above the retrieved images is the classification score calculated through Canberra distance. Images having lesser distance are in initial rows and are most similar in content to the query image. Against our proposed method, the precision/recall values of competitive methods on some of the classes of this dataset are better because of the complex nature of these classes. However, the overall average precision and mean recall scores highlight the better performance of our proposed method. The performance of the proposed method has also been statistically evaluated by utilizing the nonparametric Wilcoxon matched-pairs signed-rank test. Results of the nonparametric Wilcoxon matched-pairs signed-rank test are reported in Table 2. The level of significance is set at 0.05.and the results are analyzed in terms of z-value and p-value. As the p-values against all the competitor methods along with [1] are less than the level of significance, we can conclude that the proposed method shows robust performance.

Performance assessment on the Wang-B dataset.
The Wang-B [57] is another subset of the WANG dataset and comprises 15 semantic classes having 100 images each and a resolution of 256×384 pixels or 384×256 pixels. Fig 9 shows sample images of each category. Table 3 highlights the achieved results of the proposed method against other competitive methods. As shown in Table 3, the proposed method attains 91.05% precision in retrieving relevant images. The statistical analysis has also shown significant results as all the p-values are less than 0.05 when compared against competitor methods and [60]. Figs 10 and 11 show the top-20 retrieved images against query images taken from classes of "Bus" and "Tiger". The performance of the Wang-B dataset in terms of retrieval time against a query image is 4.86 seconds as compared to the approach presented by Amsa et. al. [61] which took 32.87 seconds.

Performance assessment on the Wang 10k dataset.
Wang 10k [62] dataset comprises 10,000 images categorized into 100 categories. Each category has 100 images of size 192×128 or 128×192 pixels. Some of the categories are ships, elephants, horses, trains, cards, butterflies, roses, mountains, sunset, musical instruments, judo-karate, etc. Fig 12 shows sample images of each category. Table 4 highlights the performance of the proposed method against other competitive methods. As shown in Table 4, the proposed method attains 78.65%   Table 5 shows the performance of our proposed method on the OT scene dataset against competitive methods. Figs 16 and 17 shows the top 20 images retrieved against query image which belongs to the class "open country" and "inside city". The statistical analysis has also shown significant results as all the p-values are less than 0.05 when compared against competitor methods and [65]. 4.2.5 Performance assessment on the Caltech 256 dataset. The Caltech 256 [66] dataset has a total of 30,607 images categorized into 257 object categories. Each category has at least 80 images having varied resolutions. It is a challenging dataset as compared to its predecessor Caltech-101 as more variation in object size, pose, and location is considered. Some of the sample images are shown in Fig 18. Table 6 shows a better performance of the proposed method against competitive methods as it achieves 80.95% precision as compared to other methods. Figs 19 and 20 shows the top-20 retrieved relevant images closest to the query image in terms of content. The p and z values of the nonparametric Wilcoxon matched-pairs signed-rank test have also shown the significant performance of our proposed method as compared to competitive methods as well as against [68].

Discussions of experimental results.
The reason for opting for an ELM classifier is its random independent feature transformation and quadratic loss function which guarantees the convergence of training to a global optimum solution [69]. As compared to traditional classifiers, it has fewer optimization constraints and better generalization capabilities [70]. One of the parameters to adjust in the ELM classifier is the no. of hidden neurons which can influence the retrieval accuracy of the proposed system. The reported retrieval accuracy is achieved when no. of hidden neurons is in the range of 200-300. The retrieval accuracy keeps on fluctuating between this range but gradually starts to increase when no. of neurons is set to 1000 or more. Wang-B, and OT datasets tend to decrease at points where no. of hidden neurons is equivalent to no. of training images. In Fig 22 it can be seen that while increasing the no. of images retrieved precision remains the same in most of the chosen datasets whereas an increase in a recall is observed when more images are retrieved, highlighting the effective performance of our proposed method.
The limitations of handcrafted approaches mentioned in sections 1-2 like limited image expressing capabilities, expensive design, etc. are addressed by utilizing a VGG-19 architecture that can learn features in an automated form. As the feature vector, we get from the FC-7 layer of the network is of a higher dimension. There needs to be a dimension reduction strategy that can not only selects the important features but also be computationally efficient while classifying the images. To address this, the incorporation of a genetic algorithm in the proposed approach not only selects the optimal features but also reduces the feature vector size. This resultant feature vector is approximately half in dimension as compared to the original feature vector. For classification, ELM being a single hidden layer feedforward neural network works better in terms of precision, recall, f-measure, and retrieval time as compared to handcrafted methods of CBIR.

Required resources and comparative analysis of computational cost.
The hardware and software resources upon which the performance of the proposed method is assessed are as follows: a PC having Intel Core i7-7700 3.60 GHz processor, RAM 8GB, Microsoft This eases the feature engineering task as well as it is invariant to scale, rotation, and translation and computationally less expensive as compared with competitive image retrieval methods. The performance comparison in terms of the computational cost (retrieval time) of the proposed method and its competitive methods for a Wang-A dataset is presented in Tables 7  and 8 represents the retrieval time of the proposed method on the Caltech-256 dataset.

Conclusion and future work
The most important factors for an image retrieval system to be termed efficient and accurate are its retrieval accuracy and utilization of computational resources. Reduction in feature vector dimensionality or extracting the appropriate features can influence both factors. So, the proposed method first extracts the features through VGG-19 architecture which resulted in a 4096-dimensional vector. All of these extracted features may not be useful and can consume more resources and time during execution. Hence irrelevant, and redundant features are discarded by utilizing a genetic algorithm. The proposed method used an ELM classifier because it's computationally fast and easily trained. Classification results over 5 datasets clearly show that the proposed method has the highest precision and recall rates among other competitive CBIR methods. In the future, we'll explore other deep architectures and different versions of the ELM classifier to enhance the CBIR process.